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Abstract —Channel estimation and precoding in hybrid analog- 
digital millimeter-wave (mmWave) MIMO systems is a fnnda- 
mental problem that has yet to be addressed, before any of the 
promised gains can be harnessed. For that matter, we propose a 
method (based on the well-known Arnold! iteration) exploiting 
channel reciprocity in TDD systems and the sparsity of the 
channel’s eigenmodes, to estimate the right (resp. left) singnlar 
subspaces of the channel, at the BS (resp. MS). We first describe 
the algorithm in the context of conventional MIMO systems, and 
derive bounds on the estimation error in the presence of distor¬ 
tions at both BS and MS. We later identify obstacles that hinder 
the application of such an algorithm to the hybrid analog-digital 
architecture, and address them individually. In view of fulfilling 
the constraints imposed by the hybrid analog-digital architecture, 
we further propose an iterative algorithm for subspace decompo¬ 
sition, whereby the above estimated subspaces, are approximated 
by a cascade of analog and digital precoder/comblner. Finally, 
we evaluate the performance of onr scheme against the perfect 
CSI, fully digital case (i.e., an equivalent conventional MIMO 
system), and conclude that similar performance can be achieved, 
especially at medium-to-high SNR (where the performance gap 
is less than 5%), however, with a drastically lower number of RF 
chains (~ 4 to 8 times less). 

Keywords—Millimeter wave MIMO systems, sparse channel esti¬ 
mation, hybrid architecture, hybrid precoding, subspace decompo¬ 
sition, Arnoldi iteration, subspace estimation, echo-based channel 
estimation. 


I. Introduction 

With the global volume of mobile data expected to increase 
by an order of magnitude between 2013 and 2019, and the 
volume corresponding to mobile devices outweighing that 
of all other devices Q, mobile network operators have the 
monumental task of meeting this exponentially increasing 
demand. Given that spectrum is a scarce and precious resource, 
future communication systems have to exhibit unparalleled 
spectral efficiency. Though earlier results date back to Q, 
1^, communication systems in the millimeter wave (mmWave) 
spectrum have been receiving growing interest over the past 
years. mmWave communication systems have the distinct 
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advantage of exploiting the huge amounts of unused (and 
possibly unlicensed) spectrum in those bands - around 200 
times more than conventional cellular systems. Moreover, 
the corresponding antennae size and spacing become small 
enough, such that tens-to-hundreds of antennas can be fitted 
on conventional hand-held devices, thereby enabling gigabit- 
per-second communication. 

However, the large number of radio frequency (RF) chains 
required to drive the increasing number of antennas, inevitably 
incurs a tremendous increase in power consumption (namely 
by the analog-to-digital converters), as well as added hardware 
cost. One elegant and promising solution to remedy this inher¬ 
ent problem is to offload part of the precoding/processing to the 
analog domain, via analog precoding (resp.combining), i.e., a 
network of phase shifters to linearly process the signal at the 
the base station (BS) (resp. mobile station (MS). This so-called 
problem of analog and digital co-design for beamforming and 
precoding in low-frequency regime was first investigated in 
0 , 0 - This architecture was later studied within the context 
of higher frequency (mmWave) systems in 0-0 - under 
the name of hybrid precoding/architecture - for the precoding 
problem. A similar setup for the case of beamforming was 
considered in a-oi- 

However, several fundamental challenges have to resolved 
before any of the promised gains can be harnessed, namely, 
estimating the (large) mmWave channel, and designing the ana¬ 
log/digital precoders and combiners accordingly. We underline 
the fact that classical training schemes developed for Multiple- 
input Multiple-output (MIMO) systems are not applicable 
for that particular case. Moreover, note that our proposed 
technique encompasses both beamforming and precoding, i.e., 
it does not depend on the number of streams. 

After a series of approximations to the mutual information, 
and taking into account precoding (excluding the receive com¬ 
biners), 0 derived an optimality condition relating the analog 
and digital precoders to the optimal unconstrained precoder 
(i.e., the right singular vectors of the channel), by assuming 
full channel state information (CSI) at both the BS and MS. 
This assumption was later relaxed in Q where an algorithm 
for estimating the dominant propagation paths was proposed, 
based on the previously proposed concept of hierarchical 
codebooks sounding in Hg, |TT). However, the algorithm 
requires a priori knowledge of the number of propagation paths 
(i.e. the propagation environment), its performance is affected 
by the sparsity level of the channel, and exhibits relatively 
elevated complexity. Finally, it appears rather inefficient to 
estimate the entire channel, while only a few eigenmodes 
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are needed for transmission: this is particularly relevant in 
mmWave MIMO channels, since the majority of eigenmodes 
have negligible power. 

The approach we present here attempts to address the 
above limitations. The proposed algorithm is based on the 
well known Arnoldi Iteration, exploits channel reciprocity 
inherent in Time-Division Duplexing (TDD) MIMO systems 
to gradually build an orthonormal basis for the corresponding 
Krylov subspace, and directly estimates the dominant left / 
right singular modes of the channel, rather than the entire 
channel. We then propose an iterative method for subspace 
decomposition, to approximate the estimated right (resp. left) 
singular subspace by a cascade of analog and digital precoder 
(resp. combiner), while taking into account the hardware con¬ 
straints of this so-called hybrid analog-digital architecture. The 
subspace estimation (SE) algorithm is based on BS-initiated 
echoing, whereby the BS sends along some beamforming 
vector, and the MS echoes its received signal back to the BS 
(using amplify-and-forward), thereby enabling the BS to obtain 
an estimate of the effective uplink-downlink channel. We hrst 
detail the algorithm in the context of conventional MIMO, 
taking into account distortions in the the system (e.g., noise, 
or other disturbances), derive bounds on the estimation error, 
and highlight its desirable features. We then adapt its structure, 
to ht the many operational constraints dictated by the hybrid 
analog-digital architecture. While we feel that aspects such as 
complexity, overhead and numerical stability are best left for 
future works, we do shed light on each of them. Although the 
main results of the paper were earlier presented in p2) , we 
provide in this work an in-depth look at our proposed methods, 
and derive several performance results. 

In the following, we use bold upper-case letters to denote 
matrices, and bold lower-case letters denote vectors. Fur¬ 
thermore, for a given matrix A, [A\i.,j denotes the matrix 
formed by taking columns i to j, of A, tr(A) denotes its 
trace, ||A|||r. its Frobenius norm, |A| its determinant, A^ its 
conjugate transpose. [A]ij = Uij denotes element {i,j) of 
A, Oi the ith of column A, and [a]i = element i in 
vector a. [AJsj;, and [A]ij represent the matrix formed by 
the strictly lower and upper triangular matrix of a square 
matrix A, respectively. denotes the n x n identity matrix, 
diag(x) is a diagonal matrix with elements of x on its 
diagonal, the real part of x, a^axiU] / crmin[i/[ the 

maximum/minimum singular value of U. Moreover, U — 
qr((7) refers to the semi-unitary matrix returned by the QR 
algorithm, with U^U = I. Finally, we let {n} = {1,..., n}, and 
Sp,, = {xe CPX9 I \x,f = 1/^ , y{t,k) G {p} X {q}}. 

II. System Model 

A. Signal Model 

Assume a single user MIMO system with M and N anten¬ 
nas at the BS and MS, respectively, where each is equipped 
with r RF chains, and sends d independent data streams (where 
we assume that d < r < min(M, N)). The downlink (DL) 
received signal is given by 

=HFGx^*'>+n^^'> ( 1 ) 



Fig. 1: Hybrid Analog-Digital MIMO system architecture 


where H G is the complex channel - assumed to be 

slowly block-fading, F G is the analog precoder, G G 

^rxd jjjg (jjgjfaj precoder, the TV-dimensional signal at the 
MS antennas, is the d-dimensional transmit signal with 
covariance matrix E[x^'^^x^*^^] = Id and is the AWGN 
noise at the MS, with Note that 

and subscripts/superscripts denote quantities at the BS and 
MS, respectively. Both the analog precoder and combiner are 
constrained to have constant modulus elements (since the latter 
represent phase shifters), i.e., F G Sm.t and W G 
(also referred to as the constant-modulus or constant-envelope 
constraint). We adopt a total power constraint on the effective 
precoder, i.e., ||i^G|||, < d, a widespread one in the hybrid 
analog-digital precoding literature §, 0. With that in mind, 
the received signal after hltering in the DL is given as, 

i = = U^W^HFGx^*^ + (2) 


where W G and U G are the analog and digital 

combiners, respectiveljQ We also assume a TDD system, 
where channel reciprocity holds. Finally, we denote the SVD 
of H as. 





■ 
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— -f #2S2r2 (3) 


where Ti G and $i G are semi-unitary, and 

Si = diag((Ti,..., CTd) is diagonal with the d largest singular 
values of H (in decreasing order). 


B. Motivation 


Keeping in line with previous work in that area, our aim is 
to design the precoders and combiners as follows. 


(4) 


f min 

I|ri--FG||| 

= Jf, G 

\s. t. 1 

|FG|||,<c;, FgS, 

f min 

11^1-will. 

— Jw, u 

Is. t. 

W G Siv^d 


The latter design criterion has been quite prevalent in earlier 
works relating to the hybrid analog-digital architecture, and 
applied rather successfully in 0, 0, flBj , p4| . After a series 
of approximations to the mutual information in 0, it was 
shown that the optimal precoders, F,G, are formulated in 


* Similarly, exploiting channel reciprocity, the uplink received signal is given 
by yd) = H^WUxd) + where is the M-dimensional signal at the 
BS andn^^^ is the AWGN noise at the BS, such that Jm- 
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exactly the same fashion as above (though their formulation 
did not include receive combining). 

Moreover, we use the following expression as a performance 
metric (i.e., the “user-rate” corresponding to a given choice of 
precoders and combiners). 


R = log 2 




(5) 


where HFG, = SNR is the signal-to-noise 

^ir) 

ratio. Moreover we assume, for simplicity, that uniform power 
allocation is performed (no waterfilling), keeping in mind that 
a power allocation matrix A can be easily incorporated in 
the expression. Although not directly optimized, the above 
expression was used in |Q, within the context of hybrid analog- 
digital precoding. As we will discuss below, the value of 
the expression in Q is related to achievable rates over the 
considered hybrid analog-digital MIMO link; in particular R 
becomes an achievable rate in the scenario that both the BS 
and MS are provided perfect knowledge of H. 

In a nutshell, Q boils down to finding FG (resp. WU) 
that “best” approximate Fi (resp. $i). Moreover, if there exists 
optimal precoders and combiners that make the distances in Q 
zero, then they must satisfy 

F*G* = ri, 


We denote by R* the resulting “user-rate” that is obtained by 
plugging in the above precoders/combiners in (|^. Then R* 
can be expressed as, 

A r{F\G\W*,U*) = log 2 I id -f SNR ( 6 ) 
Following the above discussion on the achievability of R, 
R* is the maximum achievable rate over the precoders and 
combiners, when H is known to both BS and MS. We 
underline the fact that R in depends on the subspace 
spanned by the precoders / combiners, rather than the Eu¬ 
clidean distance between the right/left dominant subspace and 
the precoder/combiner, i.e., However, optimizing metrics 
that involve span or chordal distances, is not straightforward. 
We thus emphasize that attempts at directly maximizing R 
in are outside the scope of this work: rather, the focus 
is put on proposing mechanisms for subspace estimation and 
decomposition, and analyzing their performance. 

Moreover, since we assume that no channel information is 
available at neither the BS, nor the MS, our aim is firstly to 
obtain an estimate of the subspaces in question, i.e. 
at the MS, and fi « Fi at the BS. We then propose meth¬ 
ods that optimize the precoders and combiners to accurately 
approximate the estimated subspaces, by providing means to 
solve problems such as ||fi—FG|||, and ||$i—(while 
taking into consideration the constraints inherent to the hybrid 
analog-digital architecture). 


III. Eigenvalue Algorithms and Subspace 
Estimation 

A. Subspace Estimation vs. Channel Estimation 

The aim of subspace estimation (SE) methods in MIMO sys¬ 
tems is to estimate a predetermined low-dimensional subspace 
of the channel, required for transmission. We illustrate this in 


the context of conventional MIMO systems, i.e., where pre¬ 
coders/combiners are fully digital. Eor the sake of exposition, 
we start with a simple toy example, where noiseless single¬ 
stream transmission is assumed (and ignoring any physical 
constraints). The BS selects a random unit-norm beamforming 
vector, pi, and then sends pix'^f where = 1. The 

received signal, qi — Hpi, is echoed back to the BS (in 
effect, this implies that the signal is complex conjugated before 
being sent), in an Amplify-and-Eorward (A-E) like fashion]^ 
Then, exploiting channel reciprocity, the received signal at 
the BS is first normalized, i.e., p 2 — qi/\\H^qi \\2 = 

Hpi/\\H^Hpi\\ 2 , and then echoed back to the MS. This 
simple procedure is done iteratively, and the resulting se¬ 
quences {pi} at the BS, and {g;} at the MS, are defined as 
follows, 

Pi+i=H^Hpi/\\H^Hpi\\ 2 -, qi+i=Hpi (7) 
It was noted in p3| that using the Power Method (PM), one 
can show that as I —)■ oo, pi —>■ 71 and qi —>■ cri^i, implying 
that this seemingly simple “ad-hoc” procedure will converge 
to the maximum eigenmode transmission. The authors of HD 
also generalized the latter method to multistream transmission, 
i.e., by estimating Fi and $ 1 , using the Orthogonal/Subspace 
Iteration (which was dubbed Two-way QR (TQR) in ng. 

HD). 

We note that SE schemes such as the ones described 
above, offer the following distinct advantage over classical 
pilot-based channel estimation: in spite of the large number 
of transmit and receive antennas, SE methods can estimate 
the dominant left/right singular subspaces with a relatively 
low communication overhead, when the latter have small 
dimension (relative to the channel dimensions). Consequently, 
subspace estimation is much more efficient than channel esti¬ 
mation, especially in large low-rank MIMO systems such as 
mmWave channels (because the latter estimates the dominant 
low-dimensional subspace instead of the whole channel). Eor 
the reason above, our proposed algorithm falls under the 
umbrella of SE methods. We first describe this algorithm in 
the context of “classical” MIMO systems, and later adapt it to 
the hybrid analog-digital architecture. 


B. Amoldi Iteration for Subspace Estimation 


Despite the fact that Krylov subspace methods (such as 
the Arnold! and Lanczos Iterations for symmetric matrices) 
are among the most common methods for eigenvalue prob¬ 
lems ini, their use in the area of channel/subspace estimation 
is limited to equalization for doubly selective OEDM chan¬ 
nels d, and channel estimation in CDMA systems d- 
Algorithms falling into that category iteratively build a basis 
for the Krylov subspace, /C™ = span{x,Ax, ...,A"^~^x}, one 
vector at a time. We use one of many variants of the so-called 
Arnoldi Iteration/Procedure, and a simplihed version of the 
latter is shown in Table (as presented in |[20|). The algorithm 
returns Qm = fei) • ■ • ,9m] G and an upper Hessenberg 

matrix G ^mxm^ 


QlAQm=T^, QlQm^I 


^This mechanism for MIMO subspace estimation, where the MS echoes 
back the transmitted signal using A-F, was first reported in (0 
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Set m {m < M); qi = random unit-norm ; Q = [gi] 
for Z = 1,2 ,m do 
l.api =Aqi 

1. b tk,i = q\ph fc = 1,... ,Z 

2 . n =pi — J2k=i ^k,iqk 

3. ti+i^i = ||ri ||2 ; if (Zz+i,/ = 0) stop 

4. Q = [Q, qi+i = ri/ti+ij] 

end for 

TABLE I: Arnold! Procedure 


It can be shown that the algorithm iteratively builds 
Qm, an orthonormal basis for ZC"* (when roundoff er¬ 
rors are neglected), and that Ql^AQ^^ = T^. We then 
say that the eigenvalues/eigenvectors of are called 

Ritz eigenvalues/eigenvectors, and approximate the eigenval¬ 
ues/eigenvectors of A. The main idea behind processes such 
as the Arnold! (and Lanczos) is to find the dominant eigenpairs 
of A, by finding the eigenpairs of Tm- 

We note that the Arnold! algorithm is a generalization of 
the Lanczos algorithm for the non-symmetric case, i.e., the 
latter is specifically tailored for cases where A A 0 (this is 
clearly the case in this work, since A = H^H ). This being 
said, the reason for not using the Lanczos iteration is that in 
practice, noise that is inherent to the echoing process, makes 
the Lanczos algorithm not applicable; namely, the requirement 
that Tm is tridiagonal, is violated. 

Our goal in this section is to first apply the above algorithm 
to estimate the d largest eigenvectors of A = H^H at the 
BS (which are exactly Ti), by implementing a distributed 
version of the Arnoldi process, that exploits the channel 
reciprocity inherent to TDD systems. Moreover, we extend the 
original formulation of the algorithm to incorporate a distortion 
variable (representing noise, or other distortions, as will be 
done later). 

It becomes clear at this stage, that the BS requires knowl¬ 
edge of the sequence {H^ Hqi}'^^, needed for the matrix- 
vector product in step 1 (Table |^; the latter can be accom¬ 
plished by obtaining an estimate pi, of H^Hqi, I G {m}. 
Without any explicit CSI at neither the BS nor the MS, 
we exploit the reciprocity of the medium to obtain such an 
estimate, via BS-initiated echoing: the BS sends qi over the 
DL channel, the MS echoes back the received signal in an 
A-F like fashion, over the uplink (UL) channel (following the 
process proposed in pi] , and detailed in Sect. III-A| i, i.e., 

DL : Si = Hqi + 

UL: pi^ HUi -b = H^Hqi + 

= H^Hqi+wi ( 8 ) 


(t) (r) 

where s; is the received signal in the DL, lu) ' and tu) ’ are 
distortions at the BS and MS, respectively (representing noise 
for example). 

After the echoing phase, the BS has an estimate, pi, of 
H^Hqi, as seen from ([^. The remainder of the algorithm 
follows the conventional Arnold! Iteration, and is shown in 
the Subspace Estimation using Arnoldi (SE-ARN) procedure 


procedure fi,Si = SE-ARN {H, d) 

Set m (m < M); Random unit-norm q; Q = [gi] 
for Z = 1,2, ...,m do 

// BS-initiated echoing: estimate H^Hqi 
La si—Hqi+w^f^ 

Lb Pi — Si L 
// Gram-Schmidt orthogonalization 
2.a tk,i=q\pi ,Vfc = l,...,Z 
2.b ri=pi-Yfk^^qktk,i 

2. C ti+ij = \\ri\\2 
// Update Q 

3. a Q = [Q, qi+i = ri/L+i^i] 

end for 

// Compute r 1 

Tm = 0A0 

fi = qr(Q ^ei:d) 

end procedure 

TABLE II: Subspace Estimation using Arnoldi Iteration (SE- 
ARN) 


(Table 0 . In addition to Tm at the output of the algorithm, 
we define the matrices, Tm, Wm and Dm, as follows, 
fqjH^IIqi, if Z < m, Vi < I 

= < |jri|| 2 , if Z < TO, i = Z -b 1 
[O, otherwise 

Wm = [Wi,...,Wm\, Em = [QlnWm]sL (9) 
where ri is given in Step 2.b (Table 0 . Note that similarly 
to the conventional Arnoldi Iteration, Tm is an the upper 
Hessenberg matrix. It then follows from the above definitions 
that 

Tm — Tm + [QmWm]u- (10) 

This can be easily verified by plugging in Step Lb into 2. a in 
Table M 

At the output of the SE-ARN procedure, the dominant eigen¬ 
pairs of H^H are approximated by those of Tm as follows. 
Let Tm = 0 A 0 be eigenvalue decomposition of Tm, 
where 0 is the (possibly non-orthonormal) set of eigenvectors. 
Then, it can easily be shown that Fi = qr(Qrn[0]i:d) are the 
Ritz eigenvectors of H^H, where [0]i:d has as columns the 
eigenvectors of Tm associated with the d largest eigenvalues 
(in magnitude)]^ Note that the latter procedure results in the 
BS obtaining Fi, and consequently Si, using the so-called BS- 
initiated echoing. This same procedure can be applied using 
MS-initiated echoing, to estimate $1 (i.e., the eigenvectors of 
HW), at the MS. 


^Note that, to be exact, the Ritz eigenvectors do not contain any estimation 
noise. That being said, we stick to this nomenclature, with a slight abuse of 
definition. Moreover, Si, the Ritz eigenvalues of come for free once 

the Ritz eigenvectors are obtained (Table [n|. 
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C. Perturbation Analysis 

In what follows, we extend some of the known properties 
of the conventional Arnold! iteration, to account for the esti¬ 
mation error, emanating from the distortion variable. 

Lemma 1. For the output of the Arnoldi process the following 
holds, 

(PI): 

QlAQ^=fm-E^ = Cm, ( 11 ) 

where Cm = SmAmSf^ is such that [AJ^ ^ > 0 and 
Q~1 ^ ot 


(P2) : Let be any eigenpair of Cm- Then 

(A) , 6 /) = QmSi ) IS an approximate Ritz eigenpair 
for A. Furthermore, the approximation error is such that, 

+ \\Im - QmQlWlWWmWl. 

( 12 ) 


- A) 


where = {^m]ra+l,m\[s^r'^]m\Y- 

(P3) : As m ^ M, —>■ 0, implying that 

the eigenpairs of Cm perfectly approximate the eigenpairs of 
A(up to round-off errors). 


Proof: The proof is shown in Appendix ■ 

We underline the fact that if the distortion variable Wm is 
zero, the above derivations reduce to the well-known results 
on the Arnoldi process p0| Sect. 6.2]. Lemma establishes 
the fact that each eigenpair of Cm, is associated 

with one eigenpair of aR 

Thus, one might be tempted to conclude at this point, that by 
computing the eigenpairs of Cm, one can perfectly estimate the 
eigenpairs of A, despite the presence of the distortion variable 
Wm- However, the fact remains that Cm — T’m — Em cannot 
be computed, mainly because Em is not known to the BS. 
As a result, Tm ot the output of the Arnoldi process will be 
used instead to approximate the eigenpairs of A. Now that we 
established that the eigenpairs of Cm approximate that of A, 
the natural question is how close are the eigenpairs ofTm, to 
that of Cm- 

For that purpose, we first show the following. 

Cm + QlWm = {fm " Em) + QIWm 

= fm + {QlWm-[QlWm]sL) 

= fm + [QlWm]u=Tm (13) 

where the first equality follows from the definition of Cm, and 
the last one from ( [TOl l. Thus Cm can be viewed as the matrix 
in question, and Pm — Q\nWm a perturbation matrix. We 
then apply the Bauer-Fike Theorem p2| Th. 7.2.2] to bound 
the difference in eigenvalues. 


Lemma 2. Every eigenvalue A of Tm = Cm + Pm satisfies 
where A is an eigenvalue of Cm- 


■^Though (P3) in Lemma In implies that the error in approximating the 
eigenpairs of A with those of C7m vanishes as m —^ M, our simulations will 
later show that very good approximations can be obtained, even for m <IC M. 


Proof: Refer to Appendix [B] ■ 

Summarizing thus far, Lernnia [B showed that the eigenpairs 
of A can be approximated by the eigenvalues of Cm, with 
arbitrarily small error. However, since the latter is not available, 
we approximate the eigenpairs of Cm (and consequently of A) 
by those of Tm, the upper Hessenberg matrix at the output of 
the Arnoldi process. Finally, Lemmaj^established the fact that 
this approximation error, for the eigenvalues, is upper bounded 
by the magnitude of the perturbation itself. We note that the 
relevant “error-metric” here is the distance between the true 
subspace Fi, and estimated subspace f i ex Qm^i.d (Table [n|l. 
This does suggest that the estimation error is dependent on 
0i:d, the eigenvectors of Tm- However, performing a similar 
sensitivity analysis on the eigenvectors is much more involved, 
since the sensitivity of eigenvectors generally depends on the 
clustering of eigenvalues. 

IV. Hybrid Analog-Digital Precoding for 
mmWaveMIMO systems 

In this section we turn our attention to applying the above 
framework for subspace estimation and precoding, to the hy¬ 
brid analog-digital architecture. As this section will gradually 
reveal, several obstacles have to be overcome for that matter. 
We start by presenting some preliminaries that will be used 
throughout this section. 

A. Preliminaries: Subspace Decomposition 

We will limit our discussion to the digital and analog 
precoder, keeping in mind that the same applies to the digital 
and analog combiner. In conventional MIMO systems, the 
estimates of the right and left singular subspace, Fi and $i, 
obtained using SE-ARN, can directly be used to diagonalize 
the channel. However, the hybrid analog-digital architecture 
entails a cascade of analog and digital precoder. Thus, Fi 
has to be decomposed into FG (hence the term Subspace 
Decomposition (SD)), as follows, 

Tmin /io(i=’,G) = ||fi-FG||?, 

js.’t. hi{F,G)^\\FG\\l<d (14) 

[ Eg SM,d 

We underline the fact that the authors in 0 arrived to the 
same formulation as 0 , and proposed a variation on the 
well-known Orthogonal Matching Pursuit (OMP), to tackle it. 
The same framework was recently extended in 03 to relax 
the need for dictionaries based on the array response matrix. 
An alternate decomposition was proposed by | |23| , where the 
optimization metric is the user rate. Both works were published 
after the initial submission of our paper. 

Within the context of hybrid precoding, the authors in 0 
showed that there exists (non-unique) F G SM,r,9 G 
such that Fi ^Fg, if and only if r > 2. This was extended 
in 110 where it was shown that there exists F G SM,r,G G 
C''^“ such that fi = FG, if r > 2d. We note that for 
such cases, the cost function in (HI is zero, and we refer 
to such cases as optimal decomposition -whose performance 
we evaluate in the numerical results section: although the 
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aforementioned schemes use all the available RF chains for 
the decomposition (and our decomposition uses a subset of the 
RF chains), the sum-rate performance is actually the same. 

To a certain extent, d is reminiscent of formulations 
arising from areas such as blind source separation, (sparse) 
dictionary learning, and vector quantization p4) , | |25| . Though 
there is a battery of algorithms and techniques that have been 
developed to tackle such problems, the additional hardware 
constraint on F, i.e. F G SM,r makes the use of such tools 
not possible. As a result, we will resort to developing our own 
algorithm. In spite of the non-convex and non-separable nature 
of the above quadratically-constrained quadratic program, we 
propose an iterative method that attempts to determine an 
approximate solution. 

1) Block Coordinate Descent for Subspace Decomposition: 
In this part, we further assume that only d of the r available 
RF chains are used, i.e., F G and G G 

(the reason for that will become clear later in this section). 
The coupled nature of the objective and constraints in ( [T4l l 
suggests a Block Coordinate Descent (BCD) approach. The 
main challenges arise from the coupled nature of the variables 
in the constraint (since the latter makes convergence claims of 
BCD, not possible p 6 )), and from the hardware constraint on 
F. We will show that a BCD approach implicitly enforces the 
power constraint in ([T^, and consequently the latter can be 
dropped without changing the problem. 

Our approach consists in relaxing the hardware constraint 
on F, and then applying a Block Coordinate Descent (BCD) 
approach to alternately optimize F and G (while projecting 
each of the obtained solutions for F on S). For that matter, 
we first define the Euclidean projection on the set S in the 
following proposition. 

Proposition 1. Let X G be defined as = 

\xi,k\ V(i,fc), and 

y = n 5 [X] = argmin \\U - X\\% 

U£SM,d 

denote its (unique) Euclidean projection on the set Sm d- Then 

= (I/Tm) 

Proof: The proof is straightforward variation on previous 
results such as ■ 

The latter result implies that given an arbitrary F, finding 
the closest point to F, lying in SM,d simply reduces to setting 
the magnitude of each element in F, to l/y /M. 

Neglecting the constraint on F in ( [T4l l, one can indeed 
show that for fixed G (resp. F), the resulting subproblem 
is convex in F (resp. G). With this in mind, our aim is 
to produce a sequence of updates, {Fk,Gk\k such that the 
sequence Gfc)}fe is non-increasing (keeping in mind 

that monotonicity cannot be shown due to the coupling in the 
power constraint). Thus, given Gk, each of the updates, Fk+i 
and Gk+i, are defined as as follows, 

(Jl) Fk+i=mm ho{F) = \\ti-FGk\\% 

F 

(J2) Gfc+i^min /io(G) = ||f i - Ffe+iG||^ 

G 

Both (Jl) and (J2) are instances of a non-homogeneous (un¬ 
constrained) convex quadratically-constrained quadratic pro- 


procedure [F, G] = BCD-SD (fi) 

Start with arbitrary Fq 
for fc = 0,1 , 2 , ... do 

Gk+i ^ {FlFk)-^FlT, 

F k+l n^PiG 

fc+l (G/c-riGj’.+i) 

end for 
end procedure 

TABLE III: Block Coordinate Descent for Subspace Decom¬ 
position (BCD-SD) 


gramming (QCQP) that can easily be solved (globally) by 
finding stationary points of their respective cost functions, to 
yield, 

Fk+i=riGliGkGl)-^ (15) 

Gfc+i = (16) 

We note that our earlier assumption that only d of the RF 
chains are used here (i.e. G is square), guarantees that, (GiGj) 
in ( [Thl l is invertible, almost surely: in fact, our numerical results 
show that the incurred performance loss is quite negligible. 

Moreover, note that the solution in ( [T5| l does not necessarily 
satisfy the hardware constraint on F. Thus, the result of 
Proposition can be used to compute the projection of F 
on SM,d- To prove our earlier observation that the optimal 
updates Fk+i and G^+i satisfy the power constraint in ([T^, 
we plug ( [T6| ) into the following (dropping all subscripts for 
simplicity), 

IjFGIII =tr 

I 


< ti- {{F^F)-^F^F) tr (f ifl) = d (17) 

where we assumed that ||ri|||, = 1 w.l.o.g., and used the 
fact that \i{AB) < tr(A)tr(F) for A,B > 0. Note that 
the above relation holds for any arbitrary full-rank F, and 
thus, the power constraint is satisfied even after applying the 
projection step. The above shows that if BCD is used, then the 
power constraint in ( [l4| ) is always enforced. The corresponding 
method is termed Block Coordinate Descent for Subspace 


Decomposition (BCD-SD), and is shown in Table III 


Remark 1. We underline the fact that due to the projection 
step, one cannot show that the sequence {ft,o(Ffe,G/t)}fe is 
non-increasing. Nevertheless, despite the fact that monotonic 
convergence of BCD-SD cannot be showed analytically, our 
simulations indicate that the latter is indeed the case, under 
normal operating conditions. 


Remark 2. It can be easily verified that the optimal F*,G* 
that maximize R in ( |5]l ar e such that ||F*G*|| = d. Though 
the optimal solution to |l4|l is not invariant to scaling, as far as 
the performance metric in (|^ is concerned, there in no loss in 
optimality in scaling the solution given by BCD-SD, to fulfill 
the power constraint with equality. 
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2) One-dimensional case: Note that echoing (e.g., our pro¬ 
posed mechanism in Table |I^ relies on the BS being able to 
send any vector qi, to be echoed back by the MS. For the 
hybrid analog-digital architecture, this translates into the BS 
being able to (accurately) approximate qi by figi, where // 
is a vector, gi is a scalar. As a result, subspace decomposition 
for the one-dimensional case is of great interest here. When 
d = 1, (|T4li reduces to the problem below. 


Lemma 3. Consider the single dimension SD problem, 

[min ho{f, g) = ||/||i g"^ - 2glk{f^ 

{f, 9 (18) 

[s. t. [f]i = 1 /s/m 

where g G K-i- and Then the problem admits 

a globally optimum solution given by, [f*]i = 1 /s/M /^i 

and g* = \\Ji\\i/s/M 


Proof: Refer to Appendix ■ 

Similarly to it can be verified that a power constraint is 
indeed implicitly verified. Moreover, the approximation error 
e = — /(? is such that, 

[e]i = - ||7illi/M|e^'®% Vz e {M}. (19) 


We note that when considering the effective beamformer, 
i.e., fg, the solution given by Lemma [3 is to some extent 
reminiscent of equal gain transmission in |j^|, | [28) , in terms 
of the optimal phases. 

We recall that a similar hybrid beamforming setup was 
considered in Q where the authors optimize u,w,f,g, to 
maximize the SNR as well as the spectral efficiency. Although 
our formulation optimizes the same quantities, the optimization 
metric we consider, the subspace distance, is different. 

Note that the decomposition can be written in a simple form. 
Given a vector its globally optimal decomposition (from 
the perspective of ( [T4l l) is given as, 

7i«5t/l=(ll7illi/VM) n5[7i]. 

This can be generalized to obtain an alternate method to BCD- 
SD, by decomposing Ti, in a column-wise fashion. 


= (I/Tm) [n5[7i],--- ,n5[7j] diag(|l7i||i,--- ,||7rf|ji) 

( 20 ) 

3) Numerical Results: As mentioned earlier, d was for¬ 
mulated and solved in ||^, using a variation on the well-known 
Orthogonal Matching Pursuit (OMP), by recovering F’ in a 
greedy manner, then updating the estimate of G in a least 
squares sense. We thus compare its average performance with 
our proposed method, for a case where Fi G cALxd 
that M — 64, r = 10 (for several values of d). The curves 
are averaged over 500 random realizations of Fi (the latter 
are random unitary matrices). Moreover, we follow the same 
setup for OMP as that of 0^ namely, that the dictionary is 
designed based on the array response vectors (of size 256). 
The reason for the large performance gap in Fig. is that 
BCD-SD attempts to find a locally optimal solution to ( [T4l l 
(though this cannot be shown due to the coupled variables). 
Moreover, OMP is halted after r iterations, since it recovers 
the columns of F one at a time, whereas our proposed method 
runs until reaching a stable point. With that in mind, although 



Fig. 2; Average subspace distance ||Fi — FG\/p, for our 
proposed method and OMP 


OMP might perform better in terms of approximating the span 
of Fi, it is challenging to measure and optimize such metrics 
in practice. Moreover, we recall that in its original formulation 
in OMP is indeed formulated to solve the problem at hand 
(i.e. ([g), and thus the comparison seems fair. Interestingly, 
despite its extreme simplicity, the column-wise decomposition 
in ( |20l i offers a surprisingly good performance (as seen in 

FigW 


B. Echoing in the Hybrid Analog-Digital Architecture 

It is clear by now that the gist behind the schemes described 
in this work, is to obtain an estimate of {H^Hqi}'/f,^ at 
the BS, by exploiting channel reciprocity, using BS-initiated 
echoing described in 0- However, in the case of the hybrid 
analog-digital architecture, there are several issues that prevent 
the application of the latter procedure. Firstly, the digital 
beamforming vector qi needs to be approximated by a cascade 
of an alog a nd digital beamformer, using the decomposition in 
Sect. 


IV-A 


i.e., qi = fifi -\-ei, where ei is the approximation 
error given in Moreover, the BS-initiated echoing relies 
on the MS being able to amplify-and-forward its received 
signal; this is clearly not possible using the hybrid analog- 
digital architecture. In addition, neither the BS nor MS can 
digitally process the received signal at the antennas: only after 
the application the analog precoder/combiner (and possibly the 
digital precoder/combiner) can the baseband signal be digitally 
manipulated 0, m- 

With this in mind, we emulate the A-F step in BS-initiated 
echoing, 0, as follows, qi is decomposed into at the BS 
and sent over the DL. The MS linearly processes the received 
signal in the downlink, with the analog combiner, i.e., S; = 
W\{Hf igi), and same filter is used as the analog precoder, to 
process the transmit signal in the UL, i.e., WiSi. Finally, the 



















received signal at the BS is processed with the analog precoder, 
Fi. The resulting estimate, pi, at the BS is, 

Pi=F]WWiW]H{qi-ei) (21) 

Note that the above process is possible using the hybrid 
analog-digital architecture. Since noise is present in any up¬ 
link/downlink transmission, for clarity in what follows, we 
drop the noise-related terms from all equations. Needless to 
say, their effect is accounted for in the numerical results. It 
is clear from ( |2T| that pi is no longer a “good” estimate of 
for the reasons stated below. 

1. Analog-Processing Impairments (API): Processing the 
signal at the MS with the analog combiner/precoder 
Wi greatly distorts the singular values/vectors of the 
effective channel. Moreover, processing the received 
signal at the BS with the analog combiner Fi G 
implies that pi is now a low-dimensional observation 
of the desired M-dimensional quantity H^Hqi (since 
r < M). 

2. Decomposition-Induced Distortions (DID): The error 
from decomposing qi at the BS, ei, further distorts the 
estimate (as seen in (|2T|). 

The above impairments are a byproduct of shifting the burden 
of digital precoding, to the analog domain. In what follows, 
these impairments will individually be investigated and ad¬ 
dressed. 

1) Cancellation of Analog-Processing Impairments: Our 
proposed method for mitigating analog-processing impair¬ 
ments (API) relies on the simple idea of taking multiple 
measurements at both the BS and MS, and linearly combining 
them, such that WiW\ and FiF\ approximate an identity 
matrix. 

In the DL, qi is approximated by fiPi, and fipi is sent over 
the DL channe0 times (where Kr = N/r), each linearly 
processed with an analog combiner {Wi^k G to 

obtain the digital samples {s/ k\^fi (this process is shown in 
Table (1^). Moreover, the analog combiners are taken from 
the columns of a Discrete Fourier Transform (DFT) matrix, 
i.e, 

[Wl,i,...,Wl,KA=Dr, (22) 

where Dr G £,nxn ^ normalized N x N DFT matrix 
(i.e., where each column has unit norm and satisfies the unit- 
modulus constraint). The same analog combiners, {Wi^k}k, 
are used to linearly combine to form 5/ . We dub 

this procedure Repetition-Aided (RAID) Echoing, and the 
aforementioned DL phase, is shown in Table |IV| The resulting 
signal at the MS, s;, can be rewritten as, 

Hidfrgi) = dHfrgu (23) 

where equality follows from our earlier definition of {Wi^k]k 

in \22\. Note that the effect of processing the received signal 

^When sending hgi over the DL, we can use d RF chains, i.e., 

PiGi Id = [/;,■■ ■ ,/;] diag(g,, - ,gi) Id = dfigi 
thereby resulting in an array gain factor of d. Moreover, since we know 
from GZ} that WfigiWl < 1, indeed this transmission scheme satisfies the 
power constraint. We also make use of this observation in the UL sounding. 




Send 

times 


i 

- (figi) 



W, 


fV,, 


Combine 
Compute p, 



d. 


< ^ t 

i 

d 

_ ^I.Kt 



.W/ Uj 

t 


Send 
K, times 


Fig. 3: Repetition-aided (RAID) echoing for the hybrid analog- 
digital architecture 

with the analog combiner has been completely suppressed. 
Now, Si is normalized, and echoed back in the UL direction. 

A quite similar process is used in the UL: s/ is first decom¬ 
posed into Wiui, d RF chains are used to send it over the UL, 
Kt times (where Kt = M /r), and each observation is linearly 
processed with an analog combiner G 

The resulting digital samples again linearly 

combined with the same to obtain the desired 

estimate pi. Similar to the DL case, the analog combiners are 
taken from the columns of a Discrete Fourier Transform (DFT) 
matrix, i.e, [Fi^i,Fi^Kt] = Dt- The process for the UL is 
also shown in Table |IV| We combine its steps to rewrite pi as. 

Pi HHdwiUi) = dH^WiUi (24) 

\m=l / 

At the output of the RAID procedure, the BS has the 
following pi. 

Pi = dH^wiUi = dH\si - = dH^dHfigi - e^p) 

= d^H^Hqi - d^H^Hef^ - ( 25 ) 

Note that = Qi — fidi (resp. = Si — Wiui) is the error 
emanating from approximating qi (resp. si) at the BS (resp. 
MS), that we dub BS-side (resp. MS-side) decomposition- 
induced distortion (DID). It is quite insightful to compare p/ in 
the latter equation with ( [2l] i. We can clearly see that impair¬ 
ments originating from processing the received signals with 
both Wi and Fi, have completely been suppressed. In ( |25l l. 
Pi indeed is the desired estimate, i.e., H^Hqi, corrupted by 
distortions emanating from the BS-side decomposition, 
and the MS side decomposition, e) (both investigated later 
in the next subsection). Both UL and DL phases of he process 
are illustrated in Fig. and detailed in Table |iy] 

Remark 3. Note that employing this process reduces the 
hybrid analog-digital architecture into a conventional MIMO 
channel: any transmitted vector in the DL, (figi), can be 
received in a “MIMO-like” fashion, as seen from ( |23| , at a 
cost of Kr channel uses (the same holds for the UL, as seen 
from ( |24l l ). 

It can be seen from the above, that in the DL (resp. UL), d 
RF chains are active at the BS (resp. MS), while all r RF chains 
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// DL phase 

qi = fi9i + 

si,k = wl,^Hidfrgi), yk€{Kr} 

— Yl,k=l^ l,kSl,k 

II UL phase 

Si = wiui + e\ 

Zi,m = Fl^W{dwiui), Vto G {Kt} 

Pi — ^^m—l ^l,rnZl^rn 

TABLE IV; Repetition-Aided (RAID) echoing 


are used at the MS (resp. BS), to minimize the overhead. With 
this in mind, it can be seen that the associated overhead with 
each echoing, fl = (M + N)/r (channel uses), will decrease 
as more RF chains are used. 

2 ) Imperfect Compensation of Analog-Processing Impair¬ 
ments: Though the above method perfectly removes all ar¬ 
tifacts of analog processing, the overhead is proportional to 
(M -\-N)/r. A natural question is whether a similar result can 
still be achieved when and Dt are truncated matrices i.e. 
when Kr < N/r and Kt < M/r. Perfect cancellation of API 
relies on a careful choice of the analog precoder/combiner for 
each measurement, by picking {Wi^k}k=i {^i,m}m=i 
span all the columns of (square) DFT matrices. We investigate 
the effect of picking and Dt as truncated matrices, i.e. 
when Kr < N/r and Kt < M/r. Focusing our discussion 
on just analog precoders for brevity, we seek to find a (tall) 
matrix Dt G ?7 < 1, such that. 


min|lT?j/M - DfD 


Dt 


I 




s. t. Dt G S 


M, rjM 


(26) 


Due to the apparent difficulty of the problem, one can resort 
to stochastic optimization tools, e.g. simulated annealing; this 
approach is ideal for the design of Dt (and D^ as well), since 
it is completely independent of all parameters (except M, N 
and rj), and can thus be computed off-line and stored for later 
use. Then, the resulting overhead would be reduced to 17 = 
investigations along this line are outside the 
scope of this work, but we opted to include them briefly, for 
completeness. 


C. Proposed Algorithms 

Combining the results of the previous subsections, we can 
now formulate our algorithm for Subspace Estimation and 
Decomposition (SED) for the hybrid analog-digital architec¬ 
ture (shown in Algorithm [T]); estimates of the right / left 
singular subspaces, Fi and $i, can be obtained by using 
the SE-ARN procedure (Sect. 0), keeping in mind that the 
echoing phase (Steps l.a and Lb) is now replaced by the RAID 
echoing procedure (Table |IV] Then, the multi-dimensional 
subspace decomposition procedure, BCD-SD in Sect. |IV-A[ 
is then used to approximate each of the estimated singular 
spaces, by a cascade of analog and digital precoder/combiner. 
We highlight a desirable feature for the SED algorithm; the 


subspace estimation mechanism is totally decoupled from the 
subspace decomposition part, and thus any of the latter parts 
can be substituted, if desired. 


Algorithm 1 Subspace Estimation and Decomposition (SED) 
for Hybrid Analog-Digital Architecture 

// Estimate Fi and 
fi,Si = SE-ARN (H, d) 

$1 = SE-ARN (Tf^ dl 
11 Decompose Fi and 
[F,G] = BCD-SD (f 1, p) 

[W,U ]= BCD-SD ($1, p) 

Perform waterfllling on Si 


Note that previously proposed algorithms within this context 
such as the PM and TQR in |^, are no longer applicable here; 
indeed both rely on the MS being able to amplify-and-forward 
its received signal at the antennas - clearly this modus operandi 
cannot be supported by the hybrid analog-digital architecture. 
Interestingly, it is possible to apply elements from the RAID 
echoing structure that we developed, effectively modifying the 
original echoing structure of the latter schemes, and adapting 
them to the hybrid analog-digital architecture (as shown in 
Algorithm]^. 


Algorithm 2 Modifled Two-way QR (MTQR) for Hybrid 
Analog-Digital Architecture 

for I = 1,2, ..., I do 

// Decompose each column of X 

[X]n ~Jn9n, Vn G {d} (using Lemma 

X = [Jih ,••• , 'fd9d ] 

H Send X in DL, one column at a time 
Tfe = W^HX, Vfc G {Kr} 

Y ^Zfl.WkTk ;Y = qriY) 

11 Decompose ofY 

\Y]n ~ WnUn, Vu G {d} (using Lemma 

Y = [wiui ,■■■ , WdUd ] 

H Send Y in UL, one column at a time 
Sk^F^H^Y, ykG{Kt} 

Z = ■, X = qr(Z) 

end for 


Operationally, the proposed MTQR algorithm is the same 
as the Two-way QR (TQR) in p3) , whereby Fi and $i 
are obtained iteratively; as / —> oo, X —Fi (at BS) and 
y (at MS). At each iteration of the algorithm, the BS 

sends X in the downlink, and the QR algorithm is applied 
to the received signal. Then, the resulting signal is sent by 
the MS in the uplink, and the QR algorithm is applied at 
the BS to form Y. While TQR assumes fully digital MIMO 
transmission, our contribution is to apply the RAID scheme, 
to make the transmission compatible with the hybrid analog- 
digital systems. 
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D. Bounds on Eigenvalue Perturbation 

It can be clearly seen that the iterative nature of Algorithm^ 
makes the application of Lemma to quantify the impact of 
decomposition and approximation errors, not possible. On the 
other hand, for Algorithm the fact that each H^Hqi is only 


corrupted by two sources of DID, and e\'', makes the 
latter possible. With that in mind, we specialize the result 
of Sect. \ni-B\ and Lemma (developed for generic MIMO 
systems) to the case of Algorithm in the hybrid analog- 
digital architecture. We thus relate the eigenvalues of at 
the output of SE-ARN, to t he dom inant eigenvalues of Cm, 
and consequently of A (Sect III-B i. 




Corollary 1. Every eigenvalue A of Tm satisfies 
\X-X\<m\\H\\%{3+-^) 


where X in an eigenvalue ofC^, 


d\\H\\ 


Proof: Refer to Appendix ■ 

Moreover, recall that as m ^ M, X is an eigenvalue of A 
(Lemma [T]- P3). Thus, this result directly relates the eigenval¬ 
ues of Tm, to that of A: though this holds asymptotically in m, 
our simulations will show that good approximations can still 
be obtained, even for m ^ M. Note that we have ignored the 
effect of DID compensation, within the RAID echoing process, 
for convenience. As a result, the above bound is a “pessimistic” 
performance measure. 


E. Practical Implementation Aspects 

We evaluate the communication overhead of both schemes, 
in number of channel uses, keeping in mind that the actual 
overhead will be dominated by the latter. Algorithm [TJrequires 
Kt-\-Kr channel uses per iteration, to estimate Fi, and Kt-\-Kr 
to estimate $i, for a total of 

flsED = 2m (27) 

r 

m being the number of iterations for the Amoldi process. 
Letting / denote the number of iterations for MTQR, the 
number of channel uses required for Algorithm is, 

M + N 

^MTQR = dl - ^ - (28) 

It should be emphasized here that our main focus in this 
work is to investigate the principle of subspace estimation 
employing numerical techniques, and through simulations de¬ 
scribe the performance gain that can be expected by taking 
on such an approach. Hence, our major concern is not to 
investigate a stable and low-complexity technique that can be 
readily implemented in practice. We will, however, provide 
suggestions on what can be done to enhance the stability of 
the devised schemes, while admitting that many of the prob¬ 
lems connected with practical implementation of the proposed 
method are subject to further study. Generally, it is known 
that the Arnoldi (and Lanczos) algorithm may suffer from 
numerical stability issues. Though analytically speaking, the 
basis Qm is easily shown to be orthonormal, in practice, 
however, errors resulting from floating-point operations lead 
to a loss in orthogonality (the extent to which it happens is 


dependent on the application) | |20l Sec. 7.3]. Moreover, for our 
algorithm, noise inherent to the echoing process will further 
amplify this effect. One of the widely adopted fixes for this 
matter is the Implicitly Restarted Arnoldi algorithm | |M| Sec. 
7.3]. We did experiment with such an algorithm, and though 
it does enhance the numerical stability of the algorithm, the 
resulting overhead is increased by a large factor. This issue 
is critical for the SED algorithm (that employs the RAID 
echoing), since it renders real-world implementation quite im¬ 
practical. Moreover, there are many problems connected with 
practical implementations of the Restarted Arnoldi method, 
that are subject to further study. Other methods that might 
enhance the stability the Arnoldi Iteration, such as deflation 
techniques, have been reported in p9). 


F. Discussion 


We have presented an approach to maximizing the metric R 
defined in (|^. As mentioned earlier, the value of the objective 
function is in general not an achievable rate for our system. 
However, optimizing similar expressions related to achievable 
rates has been proved to give good results in previous work on 
transmission with partial CSI m- Since any rate achievable 
with partial CSI, cannot be larger than the corresponding rate 
achievable with perfect CSI, this criterion always provides an 
upper bound on the achievable rates in our system. Hence, in 
our approach, if the proposed algorithms result in values for 
R that are closing in on the perfect CSI upper bound, then the 
scheme is performing optimally (in the sense of achievable 
rates). 

With the above in mind, we use the following, as our 
performance metric in the simulations. 


R = log2 


Id + 


^(r) 

(29) 


In that sense, R is the ‘user rate’ that is based on the actual 
channel If, and the precoders / combiners that are in turn 
designed based on the estimated channel. 


V. Numerical Results 
A. Simulation Setup 

In this section, we numerically evaluate the performance of 
our algorithms, in the context of a single-user MIMO link. We 
adopt the prevalent physical representation of sparse mmWave 
channels adopted in the literature, e.g., 0’ where only L 

scatterers are assumed to contribute to the received signal - 
an inherent property of the poor scattering nature in mmWave 
channels. 



(r) (t) 

where xl and ' are angles of arrival at the MS, and angles 
of departure at the BS (AoA/AoD) of the path, respectively 
(both assumed to be uniform over [—7r/2, 7r/2]), fii is the 
complex gain of the path such that ^ CAf{0, 1), Vi. 
Linally, UriXi^'^) and are the array response vectors 
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at both the MS and BS, respectively. For simplicity, we will 
use uniform linear arrays (ULAs), where we assume that the 
inter-element spacing is equal to half of the wavelength. In 
what follows, we also assume that M/r = 8 and N/r = 4, 
i.e., as M,N increase, so does the number of RF chains. 

1) Benchmarks/Upper bounds: We use the Adaptive Chan¬ 
nel Estimation (ACE) method (Algorithm 2 in 0) as a 
benchmark, to estimate the mm Wave channel. It is based on 
sounding of hierarchical codebooks at the BS, feedback of 
the best codebook indexes by the MS, and finding the ana¬ 
log/digital precoders and combiners using OMP ||^. Moreover, 
the authors characterized the resulting communication over¬ 
head ^acei as a function of the codebook resolution. We used 
the corresponding MATLAB implementation that was provided 
by the authors. We adjust the number of iterations for both our 
proposed schemes and the codebook resolution of benchmark 
scheme, such that ^sed = ^mtqr = flo ~ ^ace- Note that 
we do not assume any quantization for phases of the RE filters. 
We also compare the performance of the algorithms against the 
“optimal performance”, R* in (|^, where full CSIT/CSIR is 
assumed, fully digital precoding is employed, and the optimal 
precoders are used. All curves are averaged over 500 channel 
realizations. 

Remark 4. Note that if one want to use “classical” pilot- 
based channel estimation to estimate the DL channel, i.e., a 
pilot sequence of minimum length M, then the same repetition- 
based framework that was used in RAID echoing, has to be 
used to cancel the effect of W from the effective channel 
estimate: it can be easily seen that the resulting total (both DL 
and UL) number of pilots slots would be 2MN/r‘^, thereby 
making the latter method infeasible. 

B. Performance Evaluation 

We start by investigating the performance of our schemes 
against the above benchmarks, for the case where M — 
128, N = 64, L = 3, and m = id, for two cases: d = l and 
d = 2 where the resulting overhead is flo = 72 and flo = 144 
channel uses, respectively. It can be seen from Eig. 1^ that 
both proposed schemes exhibit relatively similar performances, 
that are in turn very close to the optimal performance bound 
R* (especially above —10 dB). This indeed suggests that the 
multiplexing gain achieved by conventional MIMO systems 
can still be maintained in the hybrid analog-digital architecture, 
albeit at a much lower cost: the number of required RF chains 
can be drastically decreased, resulting in savings in terms of 
cost and power consumption. Moreover, we observe a sharp 
and significant performance gap between both our schemes and 
the benchmark from 0> over all SNR ranges (the gap being 
more significant in the low-SNR regime). We also evaluate 
the so-called optimal decomposition schemes 0.0 that can 
exactly decompose Fi into FG (discussed in Sec. IV). Thus, 
the curves labeled ’Optimal Decomp.’ refer to the case where 
the optimal decomposition is used in conjunction with SED. 
Eig 1^ clearly reveals that the ability to optimally decompose 
the estimated subspaces does not bring about additional gains. 
We note that the tiny mismatch between ’Optimal Decomp.’ 
and Algorithm 1 is due to simulation resolution. 



Eig. 4: Average sum-rate of proposed schemes {M = 
128, W = 64, d = 2, L = 3, TO = 6) 



Eig. 5: Effect of number of paths L, on the average user rate 
(M = 6A,N = i2,d=2,m = 6) 



Eig. 6: Average subspace angle (M = 64, N = 32, d = i,L = 
4, TO = 6) 
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We attempt to shed light on the stability of the proposed 
algorithms, as the number of paths in the mmWave channel, 
L, increases (where we set M = 64, N = 32, d = 2, m = 6). 
For clarity we restrict the result to the low SNR regime. 
Though a degradation in the performance of both algorithms is 
expected, as L increases. Fig. [^clearly indicates that the latter 
degradation is not quite significant. Though not visible here, 
our simulations show that this degradation is not present in 
the medium-to-high SNR region. As expected, this technique 
is best used for channels with a few paths, e.g., mmWave 
channels. 

We investigate the performance of both SED and MTQR 
in terms of average subspace angle, 0 = E[a(ri,ri)] where 
a(ri,ri) (radians) is defined as the subspace angle between 
r 1 and r 1 (implemented by computing the principal angles of 
the latter subspaces). As shown in Fig. both schemes exhibit 
a similar behavior of better estimation accuracy, as the SNR 
increases. 


Remark 5. Though the performance of Algorithm seems 
to be better. Fig. 4]|6 both suggest that this gap is quite 
narrow. Moreover, both algorithms seem to exhibit very similar 
behavior. With that in mind, and for the sake of clarify of our 
results, we opt to focus on Algorithm [T] the main object of 
investigation in this work. 


We next investigate its scalability: we scale up M and 
N (assuming N = Mj^, Iw simplicity), while keeping 
everything else fixed, i.e., d = 2,m = 6, and consequently 
ilo = 144. In doing that, we noticed that the complexity of the 
benchmark scheme 0 was prohibitively high, thus preventing 
us from investigating its scalability: we were unable to get 
any results for systems larger than 128 x 64 . On the other 
hand, both our algorithms exhibit no such problems since all 
the computations that they involve are matrix-vectors/matrix- 
matrix operations. Consequently, the complexity gap between 
Algorithm^and the benchmark increases drastically, as M, N 
grow. 

Fig 1^ clearly shows that Algorithm [T] is able to harness the 
significant array gain inherent to large antenna systems (by 
closely following the optimal performance bound, R*, with a 
small constant gap), while keeping the overhead remarkably 
small. Though the performance might not be good enough to 
offset the overhead, for the 16 x 8 case, it surely does for the 
256 X 128. Moreover, note that the gap between the optimal 
performance and Algorithm is quite small (across the entire 
SNR range) for small systems dimensions, and quite small 
even for large values of M (at high SNR). The key to this 
result is to have Mjr and N/r fixed, as M, N increase. 

We also evaluate the performance of Algorithm in a 
more realistic manner, by adopting the Spatial Channel Model 
(SCM) detailed in and modifying its parameters to 

emulate mmWave channels: the number of paths is set to 4, 
the carrier frequency to 60 GFIz, the BS/MS array is modified 
to implement ULAs, and an ’urban micro’ scenario is selected, 
where a small flo is desired. Fig. shows the average perfor¬ 
mance of such a system, with M = 64, = 32, to = 2d, 

for several values of d (each resulting in different values for 



Fig. 7: Average user-rate for different M,N (N = M/2,d = 
2,L = 4,to = 6,^0 = 144) 


40 



Fig. 8: Average user-rate of proposed schemes over SCM 
channels {M = 64, N = 32, to = 2d) 


flo)- Though both our algorithm and the benchmark exhibit 
similar performances for d = 1, this gap increases with d, e.g. 
for d = 3 this performance gap is quite significant. Moreover, 
we can clearly see that Algorithm yields a relatively high 
throughput in this realistic simulation setting (especially for 
d — 3), while still keeping the overhead at a relatively low 
level. 

Evidently, increasing to (the number of iterations for the 
Amoldi) has the effect enhancing the estimation accuracy 
(and increasing the communication overhead as well 
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The marginal improvement brought about by increasing m, 
is decreasing, and thus our simulations indicated that setting 
2d < m < 3d provides a good trade-off. 


C. Discussions 

A few remarks are in order at this stage, regarding similari¬ 
ties and differences between our two proposed algorithms. As 
discussed in Remark]^ when the communication overhead is 
normalized, both SED and MTQR exhibit a similar behavior 
and performance profile, across the entire SNR range (with a 
relatively small performance gap); indeed they can be used 
interchangeably with no change at all in the operational 
requirements. However, as this work shows, we have an 
accurate analytical description of the behavior of SED; the 
Arnoldi algorithm was adapted to the subspace estimation part 
(with some analytical performance guarantees), and BCD-SD 
to mathematically describe the decomposition algorithm. In 
contrast, MTQR is a (heuristic) variation on the original TQR, 
whose behavior we have not modeled analytically. 

One of the conclusions suggested by all the above results, 
is the fact that the low-SNR performance of the proposed 
schemes is rather poor. However, interestingly, Eigs. |4][^ un¬ 
ambiguously point out that this is the case for the benchmark 
scheme as well (ACE in ||^); one might be tempted to con¬ 
jecture at this point that this low-SNR behavior is an inherent 
aspect of mmWave channel estimation. Initial investigations 
reveal that, if more RE chains (more than r) can be employed 
during the RAID echoing phase, the low-SNR performance 
can be greatly boosted. 

VI. Conclusion 


We can rewrite the latter equation in matrix form, using the 
definitions of T^, Wm given in (|^, 

rn — QmTm T m\m+l,m Qm+l^ln T 

(31) 

where bm is the elementary vector, and = 

[Qlw^]u- We can further simplify the above, using the fact 
that — I in and — 0, 

Qln-^Qm + QlnW ^ = T m + E^n 

Using the definition of Em, we write, 

QlAQm =fm + [QlWm]u " QlWm 

_ r A 

m ^ m 

where Em = as defined in 0. 


(P2) ; Noting that fm+Em = Cm+Qln^rn, we rewrite iB 
as, 

AQm Qm^m — ]^m]m+l,7n Qm+lbm M QmQ^n')^^n 
Multiplying the latter equation by and using the fact that 
and Qmsi"^'> = 


= [fm]m+l,m qm+lbls[^^ - {IM - QmQl)Wms[^ 

Einally, the desired residual is upper bounded as, 

< {[fmU+l,m\blsi^^\f + WilM-QmQlWmsi^^Wl 

< i[fm]m+l,m\[st^]m\r + \\I M - QinQlWlW m\\l 

where the last inequality follows from ||BiB 2 a ;||2 < 

\\B^\\l.\\B^rF.\\x\\l 


We proposed an algorithm for blindly estimating the left 
and right singular subspace of a mmWave MIMO channel, by 
exploiting channel reciprocity that is inherent to TDD systems. 
Though the algorithm is a perfect match for conventional 
(large) MIMO systems, we extended it to operate under the 
constraints dictated by the hybrid analog-digital architecture, 
and showed via simulations that it is a good fit for large MIMO 
channels, with low rank, e.g., mmWave channels. Einally, our 
simulations showed that a similar performance to the ideal case 
(fully digital perfect CSI) can be achieved, with a only a few 
RE chains, thereby resulting in significant saving in energy and 
cost, over conventional MIMO systems. 
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Appendix 

A. Proof of Lemma 

(PI) ; Combining steps (2.b) and (3.a) in the SE-ARN 
procedure, we write, 

i-i-i i 

Aqi P Wl — ^ 3“ ^ Qi , ^ 

i=r i=r 


(P3) ; The proof immediately follows by noting that \\Im — 
QmQtnWF 0 and [fm]m+i,m 0, as m M, thereby 
implying that — A- 


,(M) _ UM)^(M )||2 ^ 


B. Proof of Lemma 

The proof follows from a direct application of the Bauer- 
Eike Theorem |22 Th. 7.2.2]. Let Cm — SmAmSf^ be the 
diagonalizable matrix in question, and Tm — Cm + Pm the 
“perturbed” matrix. Then, every eigenvalue A of Tm satisfies, 

|A - Ap < \\Sm\\l\\Sm^\\l\\Pm\\l = llQl^mf^ 
where A is an eigenvalue of Cm, and ||B|j 2 = cTmaxiB) is the 
vector-induced matrix 2-norm. The last equality follows from 
the fact that Sm is unitary, as discussed in Lemma Using 
the fact that ||P|j 2 < II-BIIf, we rewrite the last equation, 
|A-Ap<||Ql„IU^|||,<||^ "2„w ii2_^nw 112 

This concludes the proof. 


I FlI^mllF “ irdfWmWF 


C. Proof of Lemma 

Note that there is not loss in optimality by assuming the 
g G K+. Moreover, exploiting the structure of ho, the globally 
optimal solution can be found by optimizing for /, assuming 



14 


g IS fixed (and vice) versa, l.e., [6] O. El Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. Heath, “Spatially 

A . 2/'xi £\ \ r^i /— sparse precoding in millimeter wave MIMO systems,” Wireless Com- 

f — argmin g {f f) 2^3x(/ '7i), s. t. [f\i = l/\]\d e ^ munications, IEEE Transactions on, vol. 13, pp. 1499-1513, March 


^ {(/f*} =argmax l/^fM 3? [ ) 

i'^i} \i=i J 

M 

{</'*} =argmax V 3? = {^i} 

where (a) follows from applying the one-to-one mapping 
[/], ^ l/y/M Thus, [/*], = 1/VM e^®-,Vz. 

Plugging f* into the original problem, the optimization of g 
is a simple unconstrained quadratic problem, 

g* = argmin _ 2g{\\^-^\\i/y/M) = ||7 i||i/VM (32) 

9 

D. Proof of Corrollary 

The proof consists of finding a closed-from expression for 

(t) ('rl 

Wm as a function of e) and e) , and applying the result of 
Lemma Note that Wi in ® can represent any distortion, 
and by comparing pi in both and ( |25] l, can infer that wi = 
- {l/d)We\"'\ Thus, Wm in 0 can be written 
as, 

Wm = , • • • , e«] - {l/d)H^ , • • • , 

4 - {l/d)H'<Ed'> 

Then using properties of the Frobenius norm, 

||W,„||^ < \\H\\%\\E^^'>\\f + il/d)\\H\\p\\Ed)y (33) 

On the other hand, recall that — Qi — hgi and = 
Si — WiUi- Thus, using the results of Sec. |IV-A2| 

||er>||2<llg/||2 + fell2<2 

||e|’'^||2 < \\dHfigi\\2 + \\wiUi\\2 < l + d\\H\\F 

and it follows that 

< v^(l + d\\H\\F) (34) 

The upper bound follows by combining ( [3^ and p4| ). 
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